RAKE Experiment 1
code::
input:
I was looking at the Wikipedia dump data, and the link notation is a nice one token in SentencePiece, so I think I can use this. I'll have to do a lot of preprocessing, though.
SentencePiece / 100 document stoplist
code::
output:
Link notation is sentencepiece: 81.00
wikipedia: 25.00
View dump data: 16.00
Becoming 1 token: 16.00
Various: 4.00
Pretreatment: 4.00
Also good: 4.00
Likely to be able to: 4.00
"Link notation is S ent ence P ie ce".
"W i ki pe dia"
It's divided into 5 tokens, and "Wikipedia" is correctly bundled.
The "law" is omitted from the stop list because of one token.
The stop list only uses about 100 pages of Wikipedia, so more could be on the stop list.
I want "link notation" to be a keyword in the first place, so SentencePiece's ticking may be inappropriate.
Maybe the token is too large because it is the one with vocab_size: 32000 that was used for the BERT Japanese model.
Mecab / 1000 document stoplist
code::
Not good: 7.50
Not done: 4.50
Pretreatment: 4.00
Link notation: 4.00
So here it is: 4.00
Likely to be able to: 4.00
1 token: 4.00
In the name of: 3.50
but: 2.00
Hands: 1.50
Dump data: 1.00
Things: 1.00
Good: 1.00
wikipedia: 1.00
sentencepiece: 1.00
I shouldn't, but I won't, so this, so I can."
This certainly doesn't look like much on Wikipedia.
Other disappointments
Wikipedia and SentencePiece score lower.
Mecab / 1000 document stoplist / use average phrase-character-length instead of average pharase-token-length
code::
I shouldn't have: 15.00
sentencepiece: 13.00
Link notation: 10.00
1 token: 10.00
Not done: 9.00
wikipedia: 9.00
So here it is: 8.00
Possible: 8.00
Pretreatment: 6.00
Dump data: 6.00
NAME: 5.00
but: 4.00
Hand: 2.00
Things: 2.00
Good: 2.00
---
This page is auto-translated from /nishio/RAKE実験1. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.